Linear models are useful when the response (e.g. time) is a continuous quantity, eg, 3.1415, -2.71 etc.
Generalized linear models (GLMs) are used when the response is not continuous
logistic regression is used when the response is binary (0/1 or True/False).
Poisson regression is used when the response is a non-negative integer value, e.g., 0, 12, 1034 etc.
Suppose \(Y\) is a variable which counts something (eg fire incidents).
We might model the distribution of \(Y\) as a Poisson distribution \[Y \sim Poisson(\lambda)\]
The mean (average) value of \(X\) is given by the free parameter \(\lambda\).
In Poisson regression, we model the mean, \(\lambda\), using a linear combination of our covariates.
Example: Let \(Y_1, \ldots, Y_{365}\) be the number of fire incidents in Scotland on each day in the year 2022.
A Poisson regression model would assume that \[Y_i \sim Poisson(\lambda_i)\]
The challenge is in specifying a suitable model for the means, \(\lambda_1, \ldots, \lambda_{365}\).
We could assume the mean number of incidents each day varies depending on the day of the week.
day weekday events
1 1 Monday 130
2 2 Tuesday 182
3 3 Wednesday 169
4 4 Thursday 170
5 5 Friday 217
6 6 Saturday 293
Let \(i=1, \ldots, 365\) indicate the day of the year.
A Poisson regression model assumes that the log of the mean relates linearly to the covariates. So for example, if \[Y_i \sim Poisson(\lambda_i)\] then we might model
\[\log \lambda_i = a_{day(i)}=\begin{cases} a_1& \mbox{ if day i is a Monday}\\
a_2 &\mbox{ if day i is a Tuesday}\\
\vdots\\
a_7 & \mbox{ if day i is a Sunday}\end{cases}\]
This model has 7 free parameters, \(a_1, a_2,\ldots, a_7\).
To fit a Poisson regression model in R, we just have to specify the linear formula for the log of the mean.
So in this case
\[\mathtt{events}\sim \mathtt{weekday-1}\]
The \(\mathtt{-1}\) gives us the correct parameterization.
Call: glm(formula = events ~ weekday - 1, family = poisson(), data = fire)
Coefficients:
weekdayFriday weekdayMonday weekdaySaturday weekdaySunday
5.298 4.996 5.710 5.275
weekdayThursday weekdayTuesday weekdayWednesday
5.142 5.052 5.146
Degrees of Freedom: 364 Total (i.e. Null); 357 Residual
Null Deviance: 600500
Residual Deviance: 395.7 AIC: 2982
This gives us the parameter estimates \(a_1, \ldots, a_7\).
Exercise: What happens if you instead use \(\mathtt{events}\sim \mathtt{weekday}\) as the model formula?
weekdayFriday weekdayMonday weekdaySaturday weekdaySunday
5.297740 4.996432 5.710491 5.275265
weekdayThursday weekdayTuesday weekdayWednesday
5.141776 5.052318 5.146488
What do these numbers mean?
Remember we are modelling the log of the mean \[\log \lambda_i = a_{day(i)}\]
If we look at the log of the mean number of events per day,
library(dplyr)
fire |> group_by(weekday) |> summarise(average= mean(events)) |> mutate(log=log(average))# A tibble: 7 × 3
weekday average log
<chr> <dbl> <dbl>
1 Friday 200. 5.30
2 Monday 148. 5.00
3 Saturday 302. 5.71
4 Sunday 195. 5.28
5 Thursday 171. 5.14
6 Tuesday 156. 5.05
7 Wednesday 172. 5.15
we see this is the same as the parameter estimates.
Consider a model that just distinguishes between weekdays and the weekend:
\[\log \lambda_i = \begin{cases} a &\mbox{ if day i is a week day}\\
b &\mbox{ if day i is a weekend day}\end{cases}
\]
Check the parameter estimates are the log of the mean number of events on weekdays and weekends.
Compare this model to the previous model
Call: glm(formula = events ~ weekend - 1, family = poisson(), data = fire2)
Coefficients:
weekendFALSE weekendTRUE
5.132 5.516
Degrees of Freedom: 364 Total (i.e. Null); 362 Residual
Null Deviance: 600500
Residual Deviance: 2066 AIC: 4643
# A tibble: 2 × 3
weekend average log
<lgl> <dbl> <dbl>
1 FALSE 169. 5.13
2 TRUE 249. 5.52
We can compare the models by splitting into test and training sets.
test.ind <- sample(1:364, 100) # 100 test points
fire.test = fire2 |> slice(test.ind)
fire.train = fire2 |> slice(-test.ind)
model1.train <- glm(events~weekday-1, family=poisson(), data=fire.train)
model2.train <- glm(events~weekend-1, family=poisson(), data=fire.train)
pred1 <- predict(model1.train, newdata=fire.test, type='response')
pred2 <- predict(model2.train, newdata=fire.test, type='response')
#MSE
sum((fire.test |> dplyr::select(events) - pred1)^2)/100[1] 242.8843
[1] 1251.816
So the model that assumes a different mean for each day of the week has a much smaller MSE than the model that only distinguishes weekend from week days.
Often in Poisson models we have to consider the exposure level for each event.
Suppose we have data on the number of fire incidents in a number of different Scottish towns.
Everything else being equal, we would expect \[y_1 < y_2 < y_3\]
We model the varying opportunity for incidents using an exposure variable.
We model the rate of incidents per some fundamental unit
The choice of unit doesn’t change the fundamental model, we just need to be consistent.
If we make the fundamental unit incidents per property per year, then the exposure is the number of years times the number of properties in the twon \[Exposure = \#years \times \#properties \]
Our model for the expected number of events per unit, \(y\), corresponding to covariate information \(x\) is
\[E(y|x) = \mbox{Exposure} \times {\rm e}^{a+bx_1+cx_2+\ldots}\]
If we take the logarithm of both sides the model becomes
\[\log E(y|x) = \log(Exposure) + a+bx_1+cx_2+\ldots \]
Let’s download a new dataset which has data on the number of fire events in Paisley, St Andrews, and Stirling.
fire2 <- read.csv("https://rich-d-wilkinson.github.io/docs/fire2.csv")
fire2$weekday <- factor(fire2$weekday, levels=c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
fire2$town <- as.factor(fire2$town)
# This orders the days of the week in the usual way.
head(fire2, 10) X day weekday events town nppty
1 1 1 Monday 10 Paisley 25231
2 2 2 Tuesday 6 Paisley 25231
3 3 3 Wednesday 8 Paisley 25231
4 4 4 Thursday 13 Paisley 25231
5 5 5 Friday 10 Paisley 25231
6 6 6 Saturday 14 Paisley 25231
7 7 7 Sunday 15 Paisley 25231
8 8 8 Monday 9 Paisley 25231
9 9 9 Tuesday 7 Paisley 25231
10 10 10 Wednesday 13 Paisley 25231
Exercise: What could you plot here to understand the data?
# A tibble: 21 × 3
# Groups: town [3]
town weekday `mean(events)`
<fct> <fct> <dbl>
1 Paisley Monday 10.4
2 Paisley Tuesday 11.6
3 Paisley Wednesday 11.6
4 Paisley Thursday 11.2
5 Paisley Friday 13.8
6 Paisley Saturday 20.9
7 Paisley Sunday 13.9
8 St Andrews Monday 5.42
9 St Andrews Tuesday 5.71
10 St Andrews Wednesday 5.81
11 St Andrews Thursday 6.02
12 St Andrews Friday 6.67
13 St Andrews Saturday 10.3
14 St Andrews Sunday 6.67
15 Stirling Monday 7.96
16 Stirling Tuesday 8.85
17 Stirling Wednesday 8.71
18 Stirling Thursday 9.06
19 Stirling Friday 10.7
20 Stirling Saturday 16.5
21 Stirling Sunday 10.8
We can see from the plots that the number of incidents varies by day and town.
The simplest sensible model would assume the town variation can be explained by the differing exposure (number of properties).
Call: glm(formula = events ~ offset(log(nppty)) + weekday - 1, family = poisson(link = log),
data = fire2)
Coefficients:
weekdayMonday weekdayTuesday weekdayWednesday weekdayThursday
-7.534 -7.442 -7.443 -7.435
weekdayFriday weekdaySaturday weekdaySunday
-7.265 -6.842 -7.259
Degrees of Freedom: 1092 Total (i.e. Null); 1085 Residual
Null Deviance: 32280000
Residual Deviance: 1987 AIC: 6419
A more complex model would be to assume the three towns have different rates even after accounting for exposure:
Call: glm(formula = events ~ offset(log(nppty)) + weekday * town, family = poisson(link = log),
data = fire2)
Coefficients:
(Intercept) weekdayTuesday
-7.789963 0.103148
weekdayWednesday weekdayThursday
0.103148 0.074503
weekdayFriday weekdaySaturday
0.279360 0.692226
weekdaySunday townSt Andrews
0.287682 0.768195
townStirling weekdayTuesday:townSt Andrews
0.370722 -0.051323
weekdayWednesday:townSt Andrews weekdayThursday:townSt Andrews
-0.034628 0.029794
weekdayFriday:townSt Andrews weekdaySaturday:townSt Andrews
-0.071943 -0.051866
weekdaySunday:townSt Andrews weekdayTuesday:townStirling
-0.080264 0.002212
weekdayWednesday:townStirling weekdayThursday:townStirling
-0.013122 0.054490
weekdayFriday:townStirling weekdaySaturday:townStirling
0.017339 0.035346
weekdaySunday:townStirling
0.014389
Degrees of Freedom: 1091 Total (i.e. Null); 1071 Residual
Null Deviance: 2598
Residual Deviance: 1110 AIC: 5570
This model includes interaction terms, i.e., each combination of town and day is given a different mean.
Which model is best?
test.ind <- sample(1:1092, 300) # 300 test points
fire.test = fire2 |> slice(test.ind)
fire.train = fire2 |> slice(-test.ind)
model1.train <- glm(events~offset(log(nppty))+weekday, family=poisson(link=log), data=fire.train)
model2.train <- glm(events~offset(log(nppty))+weekday*town, family=poisson(link=log), data=fire.train)
pred1 <- predict(model1.train, newdata=fire.test, type='response')
pred2 <- predict(model2.train, newdata=fire.test, type='response')
#MSE
sum((fire.test |> dplyr::select(events) - pred1)^2)/100[1] 49.68993
[1] 30.27288
So model 2 (with interactions) has a much smaller MSE than model 1.
In the CRIM, we model the number of incidents per day per property.
\[Exposure = \#days \times \#properties \]
Let \(y_{ijk}\) be the number of incidents recorded of risk level \(k\) in data zone \(i\) and ACORN category \(j\)
Let \(x_i\) be the standardised SIMD rank of datazone \(i\).
We considered models of the type
\[\log E(y_{ijk}|x_{i}) = \log(Exposure) + \mu + \alpha_j + \beta_k + \gamma_{jk} + \delta_{jk} x_i\]
which we can fit using commands such as
Do we need all the interactions?
Do we need an effect per datazone? \[\log E(y_{ijk}|x_{i}) = \log(Exposure) + \mu +a_i+ \alpha_j + \beta_k + \gamma_{jk} + \delta_{jk} x_i\] This requires random effects models.
Is there value in adding new covariates
With all model development, important to remember the modelling pipeline
flowchart LR
A[Plot data] --> C{Propose model}
C --> D[Fit model to data]
D --> E[Test + visualise model ]
E --> C
Simon Preston & Richard Wilkinson 31/1/2023